Link to GitHub repository - 094295_hw2
In this project we have two main tasks.
The first is to perform explanatory data analysis (EDA).
In the second is to perform prediction for the bounding box and the label of the box of each image in different experiments.
Note, we hide the code cells so that the notebook stays clean. The full code is in the included in the repository.
In this phase we explore the data. We perform a basic analysis of the data, we visualize some images and provide insights from the data.
| fileName | id | bbox | label | x | y | w | h | box_area | type | imageWidth | imageHeight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 000193__[6, 70, 87, 85]__True.jpg | 000193 | [6, 70, 87, 85] | True | 6 | 70 | 87 | 85 | 7395 | train | 224 | 168 |
| 1 | 013731__[31, 94, 140, 162]__False.jpg | 013731 | [31, 94, 140, 162] | False | 31 | 94 | 140 | 162 | 22680 | train | 224 | 224 |
| 2 | 008110__[63, 70, 105, 102]__False.jpg | 008110 | [63, 70, 105, 102] | False | 63 | 70 | 105 | 102 | 10710 | train | 224 | 224 |
| 3 | 009097__[59, 18, 13, 11]__False.jpg | 009097 | [59, 18, 13, 11] | False | 59 | 18 | 13 | 11 | 143 | train | 148 | 224 |
| 4 | 000225__[72, 70, 80, 69]__True.jpg | 000225 | [72, 70, 80, 69] | True | 72 | 70 | 80 | 69 | 5520 | train | 224 | 169 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3995 | 016921__[51, 112, 128, 138]__True.jpg | 016921 | [51, 112, 128, 138] | True | 51 | 112 | 128 | 138 | 17664 | test | 224 | 224 |
| 3996 | 017740__[16, 96, 77, 73]__False.jpg | 017740 | [16, 96, 77, 73] | False | 16 | 96 | 77 | 73 | 5621 | test | 224 | 134 |
| 3997 | 016435__[50, 96, 147, 147]__False.jpg | 016435 | [50, 96, 147, 147] | False | 50 | 96 | 147 | 147 | 21609 | test | 213 | 224 |
| 3998 | 017186__[70, 71, 53, 64]__False.jpg | 017186 | [70, 71, 53, 64] | False | 70 | 71 | 53 | 64 | 3392 | test | 224 | 218 |
| 3999 | 019067__[177, 58, 27, 30]__False.jpg | 019067 | [177, 58, 27, 30] | False | 177 | 58 | 27 | 30 | 810 | test | 148 | 224 |
20000 rows × 12 columns
train 16000 test 4000 Name: type, dtype: int64
<AxesSubplot:xlabel='type'>
| box_area | imageWidth | imageHeight | |
|---|---|---|---|
| count | 20000.0 | 20000.0 | 20000.0 |
| mean | 5250.0 | 199.0 | 203.0 |
| std | 5512.0 | 36.0 | 30.0 |
| min | -100.0 | 70.0 | 58.0 |
| 25% | 1155.0 | 168.0 | 179.0 |
| 50% | 3480.0 | 224.0 | 224.0 |
| 75% | 7480.0 | 224.0 | 224.0 |
| max | 49275.0 | 224.0 | 224.0 |
<AxesSubplot:title={'center':'Histogram Of box width'}, ylabel='Frequency'>
<AxesSubplot:title={'center':'Histogram Of box height'}, ylabel='Frequency'>
From the statistics and graphs above we can see that -
From the images above, that there are a lot of different images. The differences include -
From the images above, it seems that the annotations of the bounding boxes labels are according to the definition of proper mask. Although, the bounding boxes annotations are not perfect.
The main model & process we used is desrcibed below. Afterwards, we explain the changes we made in the second configuration.
Throughout our work we make use of the pytorch-lighting framework, which abstracts away alot of the usual 'boilerplate' code that is usually involved in building & training neural networks.
It wraps around the regular pytorch framework and makes it easy to use advanced features and to avoid mistakes.
Additionally, we make use of torchvision for handling the images.
The cleaning and preprocessing steps we perform are rather simple -
First, we parse the image information (label & bounding box location) from the image file name.
The bounding boxes are defined as [$x_1$, $y_1$, $w$, $h$]. However, the model we use expects the bounding boxes to be in [$x_1$, $y_1$, $x_2$, $y_2$] format. Therefore we correct for this mismatch.
Before loading the images into the model, they are converted to a scale of [0,1].
We note that as the images are of different sizes, we create a custom collate function that handles this mismatch properly.
For loading the data into the model we make use of pytorch's datasets and dataloaders and of pytorch-lighting's datamodule. All these help to load the images efficiently into memory, when needed.
For the model architecture we chose to use the widely popular Faster R-CNN.
It is a fast, end-to-end framework for object detection uses deep convolution networks.
The architecture was proposed in the game-changing article - Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
The architecture consists of three parts.
The first part extracts features from the raw image using a CNN module.
In our case we use Resnet18 for this module.
The second part is a Region Propoasl Network(RPN). This is is small neural network that goes over the last feature map of the previous module and predicts whether there is an object or not in that area, and if so, also the proposed bounding boxes.
The third part uses a fully connected neural network (ROI), that takes as an input the regions proposed by the RPN and extracts the prediction object class (classification) and Bounding boxes (Regression).
Implementation-wise, we use torchvision's implementation of faster R-CNN, with slight modifications as we describe below. Torchvision's implementation of faster R-CNN has two operating modes - training and evaluation. When training, the model returns only the losses, and when evaluating - only the bounding boxes and predictions. As we seek also the output when training and the loss when evaluating, we modify the torch vision code. The modified code is hosted in a public GitHub fork of the original implementation.
As the architecture has multiple steps, all of them that need to be trained, we make use of losses both from RPN module and the ROI module. From each of the RPN modules we get 2 losses -
These losses take into account the proposals.
Similarly, from the ROI module we get
These losses take into account the final predicted bounding box.
Finally, we sum all 4 losses.
We make use of the Adam optimizer, with initial learning rate of 1e-3 and default parameters. In this initial configuration there is no explicit regularization.
For selecting and evaluating the model with different hyper parameters we train the model on the train set, and verify the results on the validation set (called test images).
There are hyperparameters related to the optimization process, and ones related to the actual model.